--- small RNA 分析流程更新 更新时间: 2025年07月30日 --- # 关于样本的说明 - 本次分析测试样本:小鼠样本 # 统计原始数据测序量 ```sh mkdir -p 00_result # 创建结果目录 mkdir -p 01_report # 原始数据fastqc结果 mkdir -p 999_tempFile # 中间文件目录,其中内容可删除 seqkit stats -T -a -j64 01_seqRaw/*R1.fastq.gz > 00_result/seqStatsRaw.tsv fastqc 01_seqRaw/*R1.fastq.gz -o 01_report -t 64 ``` # 对原始数据进行质控 > 注意:质控使用trim_galore ## 1. AG方法 - 即SMARTer smRNA-Seq Kit for Illumina User Manua说明书中的方法 - 对于R1:去除前3个碱基(G或C),去除序列连续含大于等于10bp polyA/G 及其后的部分 ```shell mkdir -p 02_fastqTrim conda activate trim_galore # 先去除左侧3bp碱基,以及各种接头 #去除N占10%的序列 awk '$2~/AG/ {print $1} ' metadata.tsv | while read id ;do echo -e "${id} 正在trim!!!";trim_galore --length 1 --clip_R1 3 --max_n 0.1 -j 64 -o 02_fastqTrim 01_seqRaw/${id}_R1.fastq.gz;echo -e "${id} 结束trim!!!";done # 去除polyA(如果存在polyA) #--max_length表示超过该长度的reads将被过滤 awk '$2~/AG/ {print $1} ' metadata.tsv | while read id ;do echo -e "${id} 正在trim!!!";trim_galore -a A{10} --length 1 -j 64 -o 02_fastqTrim 02_fastqTrim/${id}_R1_trimmed.fq.gz;echo -e "${id} 结束trim!!!";done # 去除polyG(如果去除polyA后仍存在polyG) #--max_length表示超过该长度的reads将被过滤 # awk '$2~/AG/ {print $1} ' metadata.tsv | while read id ;do echo -e "${id} 正在trim!!!";trim_galore -a G{10} --length 1 -j 64 -o 02_fastqTrim 02_fastqTrim/${id}_R1_trimmed_trimmed.fq.gz;echo -e "${id} 结束trim!!!";done ``` ## 2. 将18bp以下的reads过滤掉,并保留至文件夹1-17nt_fq ```shell mkdir 1-17nt_fq awk '$2~/AG/ {print $1} ' metadata.tsv | while read id ;do cutadapt -m 18 -o 02_fastqTrim/${id}_trimComplete.fq.gz --too-short-output 1-17nt_fq/${id}_R1_discarded_reads.fq.gz 02_fastqTrim/${id}_R1_trimmed_trimmed.fq.gz;done # 若执行了去除polyG,则02_fastqTrim/${id}_R1_trimmed_trimmed.fq.gz改为02_fastqTrim/${id}_R1_trimmed_trimmed_trimmed.fq.gz ``` ## 3. 统计质控后的数据测序量,运行fastqc查看数据过滤情况(接头是否去除干净等): ```shell /home/user/anaconda3/envs/commonKits/bin/seqkit stats -T -a -j64 02_fastqTrim/*trimComplete.fq.gz > 00_result/seqStatsAfterTrim.tsv fastqc 02_fastqTrim/*trimComplete.fq.gz -o 02_fastqTrim -t 64 ``` ## 4. 统计质控前后的留存情况 > [!NOTE] > | sampleNames | reads_raw | reads_afterTrim | reads_retain % | base_raw | base_aftertrim | base_retain % | > |-------------|-----------|-----------------|----------------|-----------|----------------|---------------| > | Nuo | 13902518 | 13732622 | 98.78 | 695125900 | 310071252 | 44.61 | > | sm01 | 10040191 | 8920910 | 88.85 | 502009550 | 374546830 | 74.61 | > | sm02 | 12043942 | 9994010 | 82.98 | 602197100 | 418537730 | 69.50 | > | sm03 | 12750822 | 11626935 | 91.19 | 637541100 | 508661327 | 79.78 | ### R语言实现 ```R # R library(tidyverse) tableRaw <- read_tsv("00_result/seqStatsRaw.tsv") %>% select(1,4,5) %>% set_names(c("sampleName","readsRaw","baseRaw")) %>% mutate(sampleName=basename(sampleName)|> str_remove("_R1\\.(fastq|fq)\\.gz$")) %>% arrange(sampleName) tableTrim <- read_tsv("00_result/seqStatsAfterTrim.tsv") %>% select(1,4,5,14,15,17) %>% set_names(c("sampleName","readsTrim","baseTrim","Q20(%)","Q30(%)","GC(%)")) %>% mutate(sampleName=basename(sampleName)|> str_remove("\\.(fastq|fq)\\.gz$") |> str_remove("_trimComplete$")) %>% arrange(sampleName) left_join(tableRaw,tableTrim,by=join_by(sampleName)) %>% mutate(readsRetain=round(100*readsTrim/readsRaw,2),baseRetain=round(100*baseTrim/baseRaw,2)) %>% select(1,3,5,10,2,4,9,6,7,8) %>% rename(`readsRetain(%)`=readsRetain,`baseRetain(%)`=baseRetain) %>% write_tsv("00_result/seqTrimSummaryStat.tsv") ``` # sRNA长度筛选 ## 得到total sRNA片段的长度(未去重的序列): ```shell for sample in `ls 02_fastqTrim/*_trimComplete.fq.gz | xargs -I {} basename {} | sed 's/_trimComplete.fq.gz//'`;do seqkit fx2tab -j 64 -l -n -i -H 02_fastqTrim/${sample}_trimComplete.fq.gz | cut -f 2 > 02_fastqTrim/${sample}_length.tsv;done ``` ### 将上述total sRNA片段的长度,进行分布统计(R语言): ```R library(tidyverse) tsvFiles <- list.files("02_fastqTrim",pattern="*tsv") %>% str_c("02_fastqTrim/",.) # percent列已经乘以了100 lengthList <- map(tsvFiles,~read_tsv(file=.x) %>% count(length) %>% mutate(percent=100*n/sum(n)) %>% mutate(sampleName=.x) %>% relocate(sampleName,.before=everything())) # 以下为将表格本地化保存,可不做 lengthTable <- lengthList %>% bind_rows() %>% mutate(sampleName=str_replace(sampleName,patter="_length.tsv","")) %>% mutate(sampleName=str_replace(sampleName,pattern="02_fastqTrim/","")) lengthTable %>% write_tsv(file="00_result/lengthTable_sRNA.tsv") ``` ### 可视化绘图(R语言): ```R # R library(tidyverse) # install.packages("ggthemes") # 设置保存路径 savePath <- "./" # 封装的绘图和保存函数 lengthPlotFunc <- function(sampleName) { lengthPlot <- sampleName %>% ggplot(aes(x=length,y=percent)) + geom_col(aes(fill=percent)) + geom_text(aes(y=percent,label=n)) + scale_x_continuous(name = "Length(nt)",breaks = seq(15,max(sampleName$length))) + scale_y_continuous(name = "Fraquence Percent",labels = scales::label_percent(scale = 1)) + ggtitle(str_glue("The Sequence Length Distribution of Sample 『",unique(sampleName$sampleName),"』")) + ggthemes::theme_clean() + theme(plot.title = element_text(hjust = .5),legend.position = "none",axis.text.x = element_text(angle = 50,vjust = 1,hjust = 1)) ggsave(filename = str_glue(savePath,"/",unique(sampleName$sampleName),".png"),plot=lengthPlot,width = unit(20,"mm"),height = unit(10,"mm")) } # 批量本地化保存 walk(lengthList,lengthPlotFunc) ``` ## 去除重复序列(质控后): ```shell mkdir -p 03_rmDuplicate for sample in `ls 02_fastqTrim/*_trimComplete.fq.gz | xargs -I {} basename {} | sed 's/_trimComplete.fq.gz//'`;do echo "${sample} Is Removing Duplicates Now!!!";seqkit rmdup -j64 -s 02_fastqTrim/${sample}_trimComplete.fq.gz -o 03_rmDuplicate/${sample}.fq.gz;done ``` ## 统计去重后的数据量(即可理解为sRNA的种类): ```shell seqkit stats -T -a -j64 03_rmDuplicate/*.fq.gz > 00_result/seqStatsRmDuplicate.tsv ``` # 参考序列比对分析 - **若只分析miRNA,可直接运行【单独分析miRNA小节的代码】** ## 1. 使用bowtie比对参考基因组 > [!NOTE] Title > 需要指定bowtie的位置`/root/miniconda3/envs/smallRNA/bin/bowtie-build` > 命令行输入bowtie-build看是否存在,如果不存在则需指定路径,存在则不用,按以下代码输入 - 构建小鼠的bowtie索引 ```shell bowtie-build --threads 64 /home/user/AG305/__dataBases__/__refGene__/Mus_musculus.GRCm38.dna.primary_assembly/mmu.fa /home/user/AG305/__dataBases__/__refGene__/Mus_musculus.GRCm38.dna.primary_assembly/bowtie2/mmu.fa ``` > 此处线程设置需要特别谨慎,否则容易爆内存 ```sh mkdir -p 04_refGeneMapping conda activate rna-seq cut -f 1 metadata.tsv | tail -n+2 | while read id;do echo ${id} Is Mapping Now!!!;bowtie2 -p 64 --very-sensitive --al 04_refGeneMapping/${id}_ReadsAligned.fq --un 04_refGeneMapping/${id}_ReadsUnaligned.fq -x /home/user/AG305/__dataBases__/__refGene__/Mus_musculus.GRCm38.dna.primary_assembly/bowtie2/mmu.fa 02_fastqTrim/${id}_trimComplete.fq.gz -S 04_refGeneMapping/${id}.sam 2> 04_refGeneMapping/${id}.bowtie2.log;done # sam转bam cut -f 1 metadata.tsv | tail -n+2 | while read id;do samtools view -bF 4 04_refGeneMapping/${id}.sam > 04_refGeneMapping/${id}.bam;done #bam排序 cut -f 1 metadata.tsv | tail -n+2 | while read id;do samtools sort -@ 64 -o 04_refGeneMapping/${id}_sorted.bam 04_refGeneMapping/${id}.bam;done # # 参考基因组注释文件gtf转bed(bed文件已存在,则无需运行) # cat genomic.gtf |sed 's/;//' | awk -F '[\t *]' '/^chr/{if($3=="transcript"){print $1,$4-1,$5,$10,$12,$21,$7,$3}}' OFS="\t" >genomic.bed # 统计mapped reads在参考基因组上的区域分布 cut -f 1 metadata.tsv | tail -n+2 | while read id;do read_distribution.py -i 04_refGeneMapping/${id}_sorted.bam -r /home/user/AG305/__dataBases__/__refGene__/Mus_musculus.GRCm38.dna.primary_assembly/mm10_RefSeq.bed;done ``` ## 比对不上参考基因组的序列来源 - 使用Kraken2&Bracken分析 ```sh conda activate kraken2 # 物种注释 for filePath in $(ls 04_refGeneMapping/*_ReadsUnaligned.fq);do echo ${filePath} 开始分析!!!;kraken2 --db /home/user/AG305/__dataBases__/kraken2 --threads 64 --report ${filePath%_ReadsUnaligned.fq}.kraken2.report --output ${filePath%_ReadsUnaligned.fq}.kraken2.stdout ${filePath};bracken -d /home/user/AG305/__dataBases__/kraken2 -i ${filePath%_ReadsUnaligned.fq}.kraken2.report -r 100 -l S -o ${filePath%_ReadsUnaligned.fq}.bracken.stdout -w ${filePath%_ReadsUnaligned.fq}.bracken.report;echo ${filePath} 结束分析!!!;done # 统计比对率 for i in $(ls 04_refGeneMapping/*.kraken2.report);do echo $i;sed -n -e '1s#\t.\+##;1p' $i;done | sed 's#^ ##g' | sed 's#\.kraken2\.report##g' | sed 's#4_refGeneMapping/##' | sed 'N;s#\n#\t#g' | sed '1i sampleName\tpercent' | csvtk mutate2 -t -n annoPercent -e '100 - $2' | csvtk cut -t -f1,3 | csvtk sort -t -k2:nr | csvtk replace -t -f2 -p "$" -r "%" > 04_refGeneMapping/krakenAnnoPercent.tsv # 提取各样本占比超过1%的物种 for i in $(ls 04_refGeneMapping/*.bracken.stdout);do j=${i#4_refGeneMapping/};k=${j%.bracken.stdout};csvtk sort -k7:nr -t $i | csvtk filter -t -f "fraction_total_reads>0.01" | csvtk cut -t -f1,7 | csvtk rename -t -f1,2 -n taxonomy,percent | csvtk mutate2 -t -n sampleName -e "'$k'" | csvtk cut -t -f3,1,2;done | sed '1p;/^sampleName/d' > 04_refGeneMapping/taxonomy_greaterThanPercent1.tsv ``` ## R语言可视化 ```R # R library(tidyverse) tableRaw <- read_tsv("04_refGeneMapping/taxonomy_greaterThanPercent1.tsv") tableRaw %>% ggplot(aes(sampleName,taxonomy,fill=percent)) + geom_tile() + scale_fill_continuous(label=scales::label_percent()) + ggthemes::theme_clean() + guides(x=guide_axis(n.dodge = 2)) + labs(x=NULL,y=NULL,fill="Percentage") + theme(aspect.ratio = 1) ``` ## 分别统计"比对上"和"未比对上"参考基因组的序列的长度分布 - 序列基本情况 ```sh seqkit stats -T -j64 $(ls 04_refGeneMapping/*Aligned*fq) > 00_result/seqStatsRefGeneAligned.tsv seqkit stats -T -j64 $(ls 04_refGeneMapping/*Unaligned*fq) > 00_result/seqStatsRefGeneUnaligned.tsv ``` - 分别将“比对上”和“未比对上“参考基因组的序列转换为长度表格用于后续分析 ```sh for filePath in $(ls 04_refGeneMapping/*_ReadsAligned.fq);do fileName=$(basename $filePath);echo $fileName Is Counting Now!!!;seqkit fx2tab -j 64 -l -n -i -H $filePath | cut -f 2 > 04_refGeneMapping/${fileName%_ReadsAligned.fq}_lengthAligned.tsv;done for filePath in $(ls 04_refGeneMapping/*_ReadsUnaligned.fq);do fileName=$(basename $filePath);echo $fileName Is Counting Now!!!;seqkit fx2tab -j 64 -l -n -i -H $filePath | cut -f 2 > 04_refGeneMapping/${fileName%_ReadsUnaligned.fq}_lengthUnaligned.tsv;done mkdir -p 00_result/lengthDistribution_refGeneMapping ``` - 可视化(R语言绘图) ```R # R # 对上述序列长度表格,进行分布统计(R语言) library(tidyverse) tsvFilesAligned <- list.files("04_refGeneMapping",pattern="*_lengthAligned.tsv") %>% str_c("04_refGeneMapping/",.) tsvFilesUnaligned <- list.files("04_refGeneMapping",pattern="*_lengthUnaligned.tsv") %>% str_c("04_refGeneMapping/",.) # percent列已经乘以了100 lengthListAligned <- map(tsvFilesAligned,~read_tsv(file=.x) %>% count(length) %>% mutate(percent=100*n/sum(n)) %>% mutate(sampleName=.x) %>% relocate(sampleName,.before=everything())) lengthListUnaligned <- map(tsvFilesUnaligned,~read_tsv(file=.x) %>% count(length) %>% mutate(percent=100*n/sum(n)) %>% mutate(sampleName=.x) %>% relocate(sampleName,.before=everything())) # 设置保存路径 savePath <- "00_result/lengthDistribution_refGeneMapping" # 封装的绘图和保存函数 lengthPlotFunc <- function(sampleName,titleSuffix) { lengthPlot <- sampleName %>% ggplot(aes(x=length,y=percent)) + geom_col(aes(fill=percent)) + geom_text(aes(y=percent,label=n)) + scale_x_continuous(name = "Length(nt)",breaks = seq(15,max(sampleName$length))) + scale_y_continuous(name = "Fraquence Percent",labels = scales::label_percent(scale = 1)) + ggtitle(str_glue("The Sequence Length Distribution of Sample 『",str_replace(unique(sampleName$sampleName),pattern="4_refGeneMapping/","") %>% str_replace(.,pattern="_lengthAligned.tsv|_lengthUnaligned.tsv",""),"』 ",titleSuffix)) + ggthemes::theme_clean() + theme(plot.title = element_text(hjust = .5),legend.position = "none",axis.text.x = element_text(angle = 50,vjust = 1,hjust = 1)) ggsave(filename = str_glue(savePath,"/",str_replace(unique(sampleName$sampleName),pattern="04_refGeneMapping/","") %>% str_replace(.,pattern="_lengthAligned.tsv|_lengthUnaligned.tsv",""),"_",titleSuffix,".png"),plot=lengthPlot,width = unit(20,"mm"),height = unit(10,"mm")) } # 批量本地化保存 walk(lengthListAligned,~lengthPlotFunc(.x,"Aligned")) walk(lengthListUnaligned,~lengthPlotFunc(.x,"Unaligned")) ``` ```sh mkdir -p 04_refGeneMapping/miRNA_mature # 比对成熟体 cut -f 1 metadata.tsv | tail -n+2 | while read id;do echo ${id} Mature Is Mapping Now!!!;bowtie --threads 64 -v 0 -m 50 -a --best --strata --al 04_refGeneMapping/miRNA_mature/${id}_matureAligned.fq --un 04_refGeneMapping/miRNA_mature/${id}_matureUnaligned.fq -x /home/user/AG305/__dataBases__/smallRNA/mmu_mature_DNA.fa 04_refGeneMapping/${id}_ReadsAligned.fq 04_refGeneMapping/miRNA_mature/${id}_mature.bwt 2> 04_refGeneMapping/miRNA_mature/${id}_mature.log;done ``` # ncRNA分析 ## 1. 分析 - 注意:这里使用的是**未比对上miRNA成熟体数据库的序列**!!!! - 将fastq转换为fasta格式 ```sh mkdir -p 06_ncRNA conda activate smallRNA cut -f 1 metadata.tsv | tail -n+2 | while read id;do seqkit fq2fa -j 64 -w 0 04_refGeneMapping/miRNA_mature/${id}_matureUnaligned.fq -o 06_ncRNA/${id}.fa;done ``` - 计算Z值并变成表格形式 ```sh # 统计未匹配上miRNA成熟体数据库的序列情况 seqkit stats -T 04_refGeneMapping/miRNA_mature/*_matureUnaligned.fq > seqStatsMatureUnaligned.tsv # 变为Z值表 cat seqStatsMatureUnaligned.tsv | awk '{print $1"\t"$5}' | sed -e 's#04_refGeneMapping/miRNA_mature/##' -e 's#_matureUnaligned.fq##' -e 's#,##g' | csvtk mutate2 -w0 -t -n Z -e '8216*${sum_len}/1000000' > 06_ncRNA/Rfam_zScore.tsv ``` - 运行开始 > [!warning] > 该命令需要执行很长时间,使用`tmux`来进行运行 ```sh # 尝试使用这种代码方式 # -j 5 限制并行数 parallel -j 5 --xapply "echo '###{1}分析开始### ';date;cmscan -Z {2} --cut_ga --rfam --nohmmonly --fmt 2 --tblout 06_ncRNA/{1}_rfam.tblout -o 06_ncRNA/{1}_rfam.result --clanin /home/user/AG305/__dataBases__/Rfam/Rfam.clanin /home/user/AG305/__dataBases__/Rfam/Rfam.cm 06_ncRNA/{1}.fa;echo '###{1}分析结束!!!### ';date" ::: $(cat 06_ncRNA/Rfam_zScore.tsv | tail -n+2 | cut -f1) ::: $(cat 06_ncRNA/Rfam_zScore.tsv | tail -n+2 | cut -f3) ``` - 后续统计 ```sh # 统计各类ncRNA总数 ## 整理注释结果文件*.tblout,提取必需的列,非重叠区域或重叠区域得分高的区域 for i in $(ls 06_ncRNA/*.tblout);do awk 'BEGIN{OFS="\t";}{if(FNR==1) print "target_name\taccession\tquery_name\tquery_start\tquery_end\tstrand\tscore\tEvalue"; if(FNR>2 && $20!="=" && $0!~/^#/) print $2,$3,$4,$10,$11,$12,$17,$18; }' $i > ${i}.xls;done ## TODO xls文件中的Evalue需要进行筛选!!! # 统计ncRNA注释结果 for i in $(ls 06_ncRNA/*.tblout.xls);do awk 'BEGIN{OFS=FS="\t"}ARGIND==1{a[$2]=$5;}ARGIND==2{type=a[$1]; if(type=="") type="Others"; count[type]+=1;}END{for(type in count) print type, count[type];}' /home/user/AG305/__dataBases__/Rfam/Rfam_anno.tsv $i > ${i%.tblout.xls}.ncRNA.statistic;done ``` - 将各样本结果整理在一起 ```sh cut -f1 metadata.tsv | tail -n+2 | while read id;do sed '1i ncRNA\t'$id'' 06_ncRNA/${id}_rfam.ncRNA.statistic > 999_tempFile/${id}_rfam.ncRNA_addTitle.statistic; done # 要排除鉴定出来的miRNA csvtk join --na 0 --outer-join -f1 -t 999_tempFile/*_rfam.ncRNA_addTitle.statistic | csvtk filter2 -t -f '$ncRNA != "miRNA"'> 00_result/ncRNA_statSummary.tsv ``` - ![[small RNA 分析 20231026-20231109091302.png|500]] - 合并miRNA与ncRNA结果,生成总表 ```R # /home/user/anaconda3/envs/r_fixed/bin/R # 加载必要的包 library(tidyverse) # 读取 mirnaMappingSummary 文件并重命名列 mirna_data <- read_tsv("00_result/mirnaMappingSummary.tsv") %>% select(-Mapped_total_sRNA, -Mapped_uniq_sRNA) %>% # 移除指定列 rename( Mapped_hairpin_type = Mapped_hairpin, Mapped_mature_type = Mapped_mature ) # 读取 ncRNA_statSummary 文件并转换为长格式 ncRNA_long <- read_tsv("00_result/ncRNA_statSummary.tsv") %>% pivot_longer( cols = -ncRNA, names_to = "sampleName", values_to = "count" ) # 将 ncRNA 计数转换为宽格式(每个 ncRNA 类型作为单独列) ncRNA_wide <- ncRNA_long %>% pivot_wider( names_from = ncRNA, values_from = count, names_glue = "{ncRNA}_count" # 添加后缀 ) # 按样本名合并两个数据集 combined_data <- inner_join(mirna_data, ncRNA_wide, by = "sampleName") # 查看合并后的数据 print(combined_data) # 保存结果到文件 write_tsv(combined_data, "00_result/miRNA_ncRNA_SummaryCombined.tsv") ``` # piRNA鉴定 > [!NOTE] > - 在[piRBase (ibp.ac.cn)](http://bigdata.ibp.ac.cn/piRBase/download.php)下载人类和小鼠的成熟体piRNA数据库 > - 没有找到番茄的数据库 > - 没有找到前体的数据库 - 构建bowtie索引 ```sh /root/miniconda3/envs/smallRNA/bin/bowtie-build --threads 64 mmu.fa.gz mmu.fa.gz /root/miniconda3/envs/smallRNA/bin/bowtie-build --threads 64 hsa.fa.gz hsa.fa.gz ``` - 这里**暂且直接使用**未比对上miRNA成熟体的序列进行鉴定 ```sh mkdir -p 12_piRNA cut -f1 metadata.tsv | tail -n+2 | while read id;do echo ${id} piRNA Is Mapping Now!!!;/root/miniconda3/envs/smallRNA/bin/bowtie --threads 64 -v 0 -m 50 -a --best --strata --al 12_piRNA/${id}_piRnaAligned.fq --un 12_piRNA/${id}_piRnaUnaligned.fq -x /home/user/AG305/__dataBases__/piRBase/mmu.fa.gz 04_refGeneMapping/miRNA_mature/${id}_matureUnaligned.fq 12_piRNA/${id}_piRna.bwt 2> 12_piRNA/${id}_piRna.log;done ``` - 这里有一个问题,会出现一条reads对应多个piRNA的情况,以下是去除多重比对之后和以前的结果数,可以发现去除多重比对后结果少了一个数量级![[small RNA 分析 20231026-20231108091836.png|500]] - 因此决定按照去重后的结果进行统计 ```sh # 统计种类和数目 cut -f1 metadata.tsv | tail -n+2 | while read id ;do echo $id 正在处理piRNA;csvtk uniq -t -f1 12_piRNA/${id}_piRna.bwt | csvtk cut -t -f3 | sort | uniq -c > 12_piRNA/${id}_piRnaCountRaw.tsv;done # 表格排序及加标题 cut -f1 metadata.tsv | tail -n+2 | while read id ;do awk '{print $2"\t"$1}' 12_piRNA/${id}_piRnaCountRaw.tsv | sort -t$'\t' -k2nr | csvtk add-header -t -n "piRNA,${id}" > 12_piRNA/${id}_piRnaCount.tsv;done ``` - 分别统计各样本reads数大于`1`、`10`、`100`的数量 ```R library(tidyverse) piRnaCountFiles <- list.files(path="12_piRNA/", pattern="*_piRnaCount.tsv", full.names=T) piRnaCountList <- map(piRnaCountFiles, read_tsv) piRnaCountTableJoin <- reduce(piRnaCountList, full_join) piRnaCountTableJoinLonger <- piRnaCountTableJoin %>% pivot_longer(-piRNA, names_to="sampleName", values_to="count") countGt1 <- piRnaCountTableJoinLonger %>% filter(count>1) %>% group_by(sampleName) %>% summarise(countGt1=n()) countGt10 <- piRnaCountTableJoinLonger %>% filter(count>10) %>% group_by(sampleName) %>% summarise(countGt10=n()) countGt100 <- piRnaCountTableJoinLonger %>% filter(count>100) %>% group_by(sampleName) %>% summarise(countGt100=n()) countTableJoin <- reduce(list(countGt1,countGt10,countGt100),left_join) countTableJoin %>% write_tsv(file="00_result/piRNA_statSummary.tsv") ``` - ![[small RNA 分析 20231026-20231109114531.png|500]] # 单独分析miRNA ## 1. 筛选18-32nt的序列 ```shell ##对质控后18-32nt的数据分别进行提取(PE150测序,最长为150) mkdir -p 18_32nt_fq cut -f1 metadata.tsv | tail -n+2 | while read id ;do seqkit seq -m 18 -M 32 02_fastqTrim/${id}_trimComplete.fq.gz| gzip > 18_32nt_fq/${id}_18-32nt_R1.fq.gz ;done ``` ## 2. 使用bowtie比对参考基因组 ```shell mkdir -p 04_18-32nt_refGeneMapping conda activate rna-seq cut -f 1 metadata.tsv | tail -n+2 | while read id;do echo ${id} Is Mapping Now!!!;bowtie2 -p 64 --very-sensitive --al 04_18-32nt_refGeneMapping/${id}_ReadsAligned.fq --un 04_18-32nt_refGeneMapping/${id}_ReadsUnaligned.fq -x /home/user/AG305/__dataBases__/__refGene__/Mus_musculus.GRCm38.dna.primary_assembly/bowtie2/mmu.fa 18_32nt_fq/${id}_18-32nt_R1.fq.gz -S 04_18-32nt_refGeneMapping/${id}.sam 2> 04_18-32nt_refGeneMapping/${id}.bowtie2.log;done # sam转bam cut -f 1 metadata.tsv | tail -n+2 | while read id;do samtools view -bF 4 04_18-32nt_refGeneMapping/${id}.sam > 04_18-32nt_refGeneMapping/${id}.bam;done #bam排序 cut -f 1 metadata.tsv | tail -n+2 | while read id;do samtools sort -@ 64 -o 04_18-32nt_refGeneMapping/${id}_sorted.bam 04_18-32nt_refGeneMapping/${id}.bam;done # # 参考基因组注释文件gtf转bed(bed文件已存在,则无需运行) # cat genomic.gtf |sed 's/;//' | awk -F '[\t *]' '/^chr/{if($3=="transcript"){print $1,$4-1,$5,$10,$12,$21,$7,$3}}' OFS="\t" >genomic.bed # 统计mapped reads在参考基因组上的区域分布 cut -f 1 metadata.tsv | tail -n+2 | while read id;do read_distribution.py -i 04_18-32nt_refGeneMapping/${id}_sorted.bam -r /home/user/AG305/__dataBases__/__refGene__/Mus_musculus.GRCm38.dna.primary_assembly/mm10_RefSeq.bed;done ``` ## 2. 比对不上参考基因组的序列来源 使用Kraken2&Bracken分析 ```shell conda activate kraken2 # 物种注释 for filePath in $(ls 04_18-32nt_refGeneMapping/*_ReadsUnaligned.fq);do echo ${filePath} 开始分析!!!;kraken2 --db /home/user/AG305/__dataBases__/kraken2 --threads 64 --report ${filePath%_ReadsUnaligned.fq}.kraken2.report --output ${filePath%_ReadsUnaligned.fq}.kraken2.stdout ${filePath};bracken -d /home/user/AG305/__dataBases__/kraken2 -i ${filePath%_ReadsUnaligned.fq}.kraken2.report -r 25 -t 10 -l S -o ${filePath%_ReadsUnaligned.fq}.bracken.stdout -w ${filePath%_ReadsUnaligned.fq}.bracken.report;echo ${filePath} 结束分析!!!;done # 统计比对率 for i in $(ls 04_18-32nt_refGeneMapping/*.kraken2.report);do echo $i;sed -n -e '1s#\t.\+##;1p' $i;done | sed 's#^ ##g' | sed 's#\.kraken2\.report##g' | sed 's#4_refGeneMapping_miRNA/##' | sed 'N;s#\n#\t#g' | sed '1i sampleName\tpercent' | csvtk mutate2 -t -n annoPercent -e '100 - $2' | csvtk cut -t -f1,3 | csvtk sort -t -k2:nr | csvtk replace -t -f2 -p "$" -r "%" > 04_18-32nt_refGeneMapping/krakenAnnoPercent.tsv # 提取各样本占比超过1%的物种 for i in $(ls 04_18-32nt_refGeneMapping/*.bracken.stdout);do j=${i#4_refGeneMapping_miRNA/};k=${j%.bracken.stdout};csvtk sort -k7:nr -t $i | csvtk filter -t -f "fraction_total_reads>0.01" | csvtk cut -t -f1,7 | csvtk rename -t -f1,2 -n taxonomy,percent | csvtk mutate2 -t -n sampleName -e "'$k'" | csvtk cut -t -f3,1,2;done | sed '1p;/^sampleName/d' > 04_18-32nt_refGeneMapping/taxonomy_greaterThanPercent1.tsv ``` ## R语言可视化 ```shell # R library(tidyverse) tableRaw <- read_tsv("04_18-32nt_refGeneMapping/taxonomy_greaterThanPercent1.tsv") tableRaw %>% ggplot(aes(sampleName,taxonomy,fill=percent)) + geom_tile() + scale_fill_continuous(label=scales::label_percent()) + ggthemes::theme_clean() + guides(x=guide_axis(n.dodge = 2)) + labs(x=NULL,y=NULL,fill="Percentage") + theme(aspect.ratio = 1) ``` ## 分别统计"比对上"和"未比对上"参考基因组的序列的长度分布 - 序列基本情况 ```sh seqkit stats -T -j64 $(ls 04_18-32nt_refGeneMapping/*Aligned*fq) > 00_result/18-32nt_miRNA_seqStatsRefGeneAligned.tsv seqkit stats -T -j64 $(ls 04_18-32nt_refGeneMapping/*Unaligned*fq) > 00_result/18-32nt_miRNA_seqStatsRefGeneUnaligned.tsv ``` - 分别将“比对上”和“未比对上“参考基因组的序列转换为长度表格用于后续分析 ```sh for filePath in $(ls 04_18-32nt_refGeneMapping/*_ReadsAligned.fq);do fileName=$(basename $filePath);echo $fileName Is Counting Now!!!;seqkit fx2tab -j 64 -l -n -i -H $filePath | cut -f 2 > 04_18-32nt_refGeneMapping/${fileName%_ReadsAligned.fq}_lengthAligned.tsv;done for filePath in $(ls 04_18-32nt_refGeneMapping/*_ReadsUnaligned.fq);do fileName=$(basename $filePath);echo $fileName Is Counting Now!!!;seqkit fx2tab -j 64 -l -n -i -H $filePath | cut -f 2 > 04_18-32nt_refGeneMapping/${fileName%_ReadsUnaligned.fq}_lengthUnaligned.tsv;done mkdir -p 00_result/18-32nt_miRNA_lengthDistribution_refGeneMapping ``` - 可视化(R语言绘图) ```R # R # 对上述序列长度表格,进行分布统计(R语言) library(tidyverse) tsvFilesAligned <- list.files("04_18-32nt_refGeneMapping",pattern="*_lengthAligned.tsv") %>% str_c("04_18-32nt_refGeneMapping/",.) tsvFilesUnaligned <- list.files("04_18-32nt_refGeneMapping",pattern="*_lengthUnaligned.tsv") %>% str_c("04_18-32nt_refGeneMapping/",.) # percent列已经乘以了100 lengthListAligned <- map(tsvFilesAligned,~read_tsv(file=.x) %>% count(length) %>% mutate(percent=100*n/sum(n)) %>% mutate(sampleName=.x) %>% relocate(sampleName,.before=everything())) lengthListUnaligned <- map(tsvFilesUnaligned,~read_tsv(file=.x) %>% count(length) %>% mutate(percent=100*n/sum(n)) %>% mutate(sampleName=.x) %>% relocate(sampleName,.before=everything())) # 设置保存路径 savePath <- "00_result/18-32nt_miRNA_lengthDistribution_refGeneMapping" # 封装的绘图和保存函数 lengthPlotFunc <- function(sampleName,titleSuffix) { lengthPlot <- sampleName %>% ggplot(aes(x=length,y=percent)) + geom_col(aes(fill=percent)) + geom_text(aes(y=percent,label=n)) + scale_x_continuous(name = "Length(nt)",breaks = seq(15,max(sampleName$length))) + scale_y_continuous(name = "Fraquence Percent",labels = scales::label_percent(scale = 1)) + ggtitle(str_glue("The Sequence Length Distribution of Sample 『",str_replace(unique(sampleName$sampleName),pattern="4_refGeneMapping_miRNA/","") %>% str_replace(.,pattern="_lengthAligned.tsv|_lengthUnaligned.tsv",""),"』 ",titleSuffix)) + ggthemes::theme_clean() + theme(plot.title = element_text(hjust = .5),legend.position = "none",axis.text.x = element_text(angle = 50,vjust = 1,hjust = 1)) ggsave(filename = str_glue(savePath,"/",str_replace(unique(sampleName$sampleName),pattern="04_18-32nt_refGeneMapping/","") %>% str_replace(.,pattern="_lengthAligned.tsv|_lengthUnaligned.tsv",""),"_",titleSuffix,".png"),plot=lengthPlot,width = unit(20,"mm"),height = unit(10,"mm")) } # 批量本地化保存 walk(lengthListAligned,~lengthPlotFunc(.x,"Aligned")) walk(lengthListUnaligned,~lengthPlotFunc(.x,"Unaligned")) ``` ## 已知miRNA分析 ```shell mkdir -p 05_miRnaMapping/hairpin 05_miRnaMapping/mature # 比对前体 cut -f 1 metadata.tsv | tail -n+2 | while read id;do echo ${id} Hairpin Is Mapping Now!!!;bowtie --threads 64 -v 0 -m 50 -a --best --strata --al 05_miRnaMapping/hairpin/${id}_hairpinAligned.fq --un 05_miRnaMapping/hairpin/${id}_hairpinUnaligned.fq -x /home/user/AG305/__dataBases__/smallRNA/mmu_hairpin_DNA.fa 04_18-32nt_refGeneMapping/${id}_ReadsAligned.fq 05_miRnaMapping/hairpin/${id}_hairpin.bwt 2> 05_miRnaMapping/hairpin/${id}_hairpin.log;done # 比对成熟体 cut -f 1 metadata.tsv | tail -n+2 | while read id;do echo ${id} Mature Is Mapping Now!!!;bowtie --threads 64 -v 0 -m 50 -a --best --strata --al 05_miRnaMapping/mature/${id}_matureAligned.fq --un 05_miRnaMapping/mature/${id}_matureUnaligned.fq -x /home/user/AG305/__dataBases__/smallRNA/mmu_mature_DNA.fa 04_18-32nt_refGeneMapping/${id}_ReadsAligned.fq 05_miRnaMapping/mature/${id}_mature.bwt 2> 05_miRnaMapping/mature/${id}_mature.log;done ``` ## 1. miRNA数量统计 - 统计各样本匹配的各个miRNA前体和成熟体的数量 ```sh for i in $(ls 05_miRnaMapping/hairpin/*_hairpin.bwt);do fileName=$(basename $i);cut -f 3 $i | sort | uniq -c | awk '{print $2"\t"$1}' > 05_miRnaMapping/hairpin/${fileName%_hairpin.bwt}_mirnaHairpinTypeAndCount.tsv;done for i in $(ls 05_miRnaMapping/mature/*_mature.bwt);do fileName=$(basename $i);cut -f 3 $i | sort | uniq -c | awk '{print $2"\t"$1}' > 05_miRnaMapping/mature/${fileName%_mature.bwt}_mirnaMatureTypeAndCount.tsv;done ``` - # 合并成熟体表格(R语言),得到匹配到成熟体的sRNA的个数总表(readcount) ```R # R library(tidyverse) filePath<-list.files(path="05_miRnaMapping/mature/",pattern="*_mirnaMatureTypeAndCount.tsv",full.names=T) fileList <- map(filePath, ~read_tsv(.x,col_names=c("miRNA",str_replace(basename(.x),"_mirnaMatureTypeAndCount.tsv","")))) tableJoin <- reduce(fileList,~full_join(.x,.y,by=join_by(miRNA))) write_tsv(tableJoin,file="00_result/18-32nt_miRnaCounts.tsv") ``` - 统计 ```R # Mapped total sRNA seqkit stats -j64 -b 05_miRnaMapping/hairpin/*_hairpinAligned.fq | awk '{print $1"\t"$4}' | sed '1s/num_seqs/Mapped_total_sRNA_count/g' | sed 's/,//g' | sed 's/_hairpinAligned.fq//g' > 999_tempFile/18-32nt_Mapped_total_sRNA.tsv # Mapped uniq sRNA for i in $(ls 05_miRnaMapping/hairpin/*_hairpinAligned.fq);do seqkit rmdup -j64 -s $i -o ${i%.fq}_rmdup.fq;done seqkit stats -j64 -b 05_miRnaMapping/hairpin/*_rmdup.fq | awk '{print $1"\t"$4}' | sed '1s/num_seqs/Mapped_uniq_sRNA_count/g' | sed 's/,//g' | sed 's/_hairpinAligned_rmdup.fq//g' > 999_tempFile/18-32nt_Mapped_uniq_sRNA.tsv # Mapped hairpin wc -l 05_miRnaMapping/hairpin/*_mirnaHairpinTypeAndCount.tsv | awk '{print $2"\t"$1}' | sed -e 's#05_miRnaMapping/hairpin/##g' -e 's/_mirnaHairpinTypeAndCount.tsv//g' -e '1i file\tMapped_hairpin_type' > 999_tempFile/18-32nt_Mapped_hairpin.tsv # Mapped mature wc -l 05_miRnaMapping/mature/*_mirnaMatureTypeAndCount.tsv | awk '{print $2"\t"$1}' | sed -e 's#05_miRnaMapping/mature/##g' -e 's/_mirnaMatureTypeAndCount.tsv//g' -e '1i file\tMapped_mature_type' > 999_tempFile/18-32nt_Mapped_mature.tsv # 将上述表格合并 paste 999_tempFile/*tsv | cut -f 1,2,4,6,8 | sed '1s/file/sampleName/' > 00_result/18-32ntmirnaMappingSummary.tsv ```